Resolving Longhorn PVC Mount Failures

This guide provides detailed steps on the problem scenarios and solution on Longhorn PVC mount failures.

Problem with multipathd service

In some cases, Longhorn fails to mount Persistent Volume Claims (PVC) to pods in a Kubernetes cluster. This issue is typically caused by conflicts with the multipathd service, which may mistakenly identify Longhorn volumes as being in use, preventing the filesystem from being created.

The multipathd service is responsible for managing multiple paths to the same storage device. When it incorrectly identifies a Longhorn volume as being in use, it blocks the filesystem creation process, resulting in mount failures.

You might encounter the following error message in your Kubernetes environment:

Error Message:

Warning  FailedMount  12s (x6 over 28s)  kubelet  
MountVolume.MountDevice failed for volume "pvc-87285c92-26c4-40bd-842d-7f608d9db2d8": 
rpc error: code = Internal desc = format of disk "/dev/longhorn/pvc-87285c92-26c4-40bd-842d-7f608d9db2d8" failed:
type: ("ext4")
target: ("/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/1e70ad7ff7c1222b1d656429fcc03679fdfa8ed3d9ae0739e656b2e161bfc08d/globalmount")
options: ("defaults")
errcode: (exit status 1)
output: (
  mke2fs 1.46.4 (18-Aug-2021)

  /dev/longhorn/pvc-87285c92-26c4-40bd-842d-7f608d9db2d8 is apparently in use by the system; will not make a filesystem here!
)

Solution

Follow these steps to resolve the issue:

Step 1: Edit the multipath.conf File

Open the multipath.conf file for editing:
```
vi /etc/multipath.conf
```
Add the Configuration.
- Add the following configuration to multipath.conf file on all nodes in the cluster:
```
blacklist {
    devnode "^sd[a-z0-9]+"
}
```
- After adding the configuration, the file should look like this:
```
defaults {
    user_friendly_names yes
}
blacklist {
    devnode "^sd[a-z0-9]+"
}
```

Step 2: Restart the multipathd.service

After the multipath.conf file update, restart the multipathd service on all nodes in the cluster. Use the below command to restart it.

systemctl restart multipathd.service

Step 3: Delete and Recreate the Affected Pods

To apply the changes and resolve the issue, delete the affected pods so that Kubernetes can recreate them with the corrected configuration:

kubectl delete pod nextgen-gw-0 nextgen-gw-redis-master-0

Problem with longhorn file corruption

Longhorn cannot remount the volume when the Longhorn volume has a corrupted filesystem. The workload then fails to restart as a result of this.

Longhorn cannot fix this automatically. You will need to resolve this manually when this happens.
You might encounter the following error message in your Kubernetes environment:

Error Message:

Events: 
  Type     Reason       Age                  From     Message 
  ----     ------       ----                 ----     ------- 
  Warning  FailedMount  56s (x5809 over 8d)  kubelet  MountVolume.MountDevice failed for volume "pvc-b3ca140a-dab9-49f6-9f39-063594e58521" : rpc error: code = Internal desc = 'fsck'                                                        found errors on device /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 but could not correct them: fsck from util-linux 2.39.3 
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 contains a file system with errors, check forced. 
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521: Unattached inode 1555

Solution

Follow these steps to resolve the issue:

Step 1: Identify the Node Running the Pod

Run the following command to find the node where the gateway pod is running:

kubectl get pods –o wide

Sample Responce:

root@opsramp-gateway:/home/gateway-admin# kubectl get pods -o wide 
NAME                        READY   STATUS    RESTARTS        AGE   IP               NODE             NOMINATED NODE   READINESS GATES 
nextgen-gw-0                0/3     ContainerCreating   0               12m   10.42.0.31       opsram-pgateway   <none>           <none> 
nextgen-gw-redis-master-0   1/1     Running   0               25m   10.42.0.29       opsramp-gateway   <none>           <none>

From this output, we see that the gateway pod is running on the opsramp-gateway node.

Log in to the node (opsramp-gateway) where the pod is running. Then, run the following command to repair the corrupted filesystem:

fsck –y <file-path>

Note

Obtain the <file-path> by describing the pod:

Kubectl describe pod nextgen-gw-0

Sample Responce:

Events: 
  Type     Reason       Age                  From     Message 
  ----     ------       ----                 ----     ------- 
  Warning  FailedMount  56s (x5809 over 8d)  kubelet  MountVolume.MountDevice failed for volume "pvc-b3ca140a-dab9-49f6-9f39-063594e58521" : rpc error: code = Internal desc = 'fsck'                                                        found errors on device /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 but could not correct them: fsck from util-linux 2.39.3 
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 contains a file system with errors, check forced. 
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521: Unattached inode 1555

In this case, the file path is /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521, so run:

fsck –y /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521

Step 3: Delete the Affected Pods

To apply the fixes, delete the affected pod so Kubernetes can recreate it:

kubectl delete pod nextgen-gw-0

If multiple pods are affected, repeat the deletion process for each.

Problem with multipathd service

Solution

Step 1: Edit the multipath.conf File

Step 2: Restart the multipathd.service

Step 3: Delete and Recreate the Affected Pods

Problem with longhorn file corruption

Solution

Step 1: Identify the Node Running the Pod

Step 2: Login to the node and fix the file corruption issue

Note

Step 3: Delete the Affected Pods